This workshop will cover a set of statistical and data analysis skills using the R programming language over two days
Day 1:
Day 2:
To learn fundamental R coding and how to create a reproducible report using R markdown. Then, learn how to source and wrangle data using R in the context of a research question.
R is mainly used as an object oriented statistical coding language. This means that we read-in or create objects and then perform functions on them. A set of functions are organized into a package. These user developed packages are essential to the success of R as a coding language. At it’s most fundamental state, R is a sophisticated calculator. For example:
2+2
## [1] 4
Or if we created a list (object) using c() then we could
perform the same operation on the list assigned using <-
with the name new_list as in:
new_list <- c(2,3,4,5,6,4,5,6,7,7,8)
new_list + 5
## [1] 7 8 9 10 11 9 10 11 12 12 13
Calculating basic statistics on the new_list using built
in R functions. In this code chunk we calculate the mean of the list
using the base package. This is specified as
package::specific_function.
base::mean(new_list)
## [1] 5.181818
stats::median(new_list)
## [1] 5
base::summary(new_list)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 4.000 5.000 5.182 6.500 8.000
Now individuals such as Hadley
Wickham have built a set of packages that work well together. One of
the most popular is called the tidyverse. During this
workshop we well utilize some of the packages from the tidyverse to
import and wrangle our data.
First we must install and load the package for use.
install.packages("tidyverse")
It is most popular to use the function library() to load
packages because it allows one to use the “latest and greatest” package
each time the document code is executed. It is a better practice to use
require() to load packages when document content will be
executed often.
library(tidyverse)
Using the readr package from the tidyverse,
data with various extensions such as csv, or
txt can be used to import data. Shortened file paths
"./with_in_project_file_path" can be utilized when the data
exists within the current working directory which is usually the same as
the project. Here we will be importing data from AirBNB.
airbnb_data <- readr::read_csv("./data/airbnb_listings.csv")
Now that we have read the data let us preview the first 10 rows. In
this code chunk you will notice a symbol %>% at the end
of each line of code which represents piping. As we wrangle our objects
we are simply passing the operations along and given a preview at each
step before saveing the finished wrangled object if neccessary.
airbnb_data %>%
utils::head(n = 10) %>%
knitr::kable() %>%
kableExtra::kable_styling() %>%
kableExtra::scroll_box(width = "100%", height = "200px")
| id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | number_of_reviews_ltm | license |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2595 | Skylit Midtown Castle | 2845 | Jennifer | Manhattan | Midtown | 40.75356 | -73.98559 | Entire home/apt | NA | 30 | 49 | 2022-06-21 | 0.28 | 3 | 300 | 0 | NA |
| 5136 | Spacious Family Friendly Duplex w/ Patio + Yard | 7378 | Rebecca | Brooklyn | Sunset Park | 40.66265 | -73.99454 | Entire home/apt | 215 | 30 | 4 | 2023-08-20 | 0.03 | 1 | 71 | 1 | NA |
| 6848 | Only 2 stops to Manhattan studio | 15991 | Allen & Irina | Brooklyn | Williamsburg | 40.70935 | -73.95342 | Entire home/apt | 81 | 30 | 193 | 2024-05-18 | 1.05 | 1 | 193 | 3 | NA |
| 6872 | Uptown Sanctuary w/ Private Bath (Month to Month) | 16104 | Kae | Manhattan | East Harlem | 40.80107 | -73.94255 | Private room | 65 | 30 | 1 | 2022-06-05 | 0.04 | 2 | 365 | 0 | NA |
| 6990 | UES Beautiful Blue Room | 16800 | Cyn | Manhattan | East Harlem | 40.78778 | -73.94759 | Private room | 65 | 30 | 247 | 2024-03-06 | 1.38 | 1 | 212 | 2 | NA |
| 7064 | Amazing location! Wburg. Large, bright & tranquil | 17297 | Joelle | Brooklyn | Williamsburg | 40.71248 | -73.95881 | Private room | NA | 30 | 13 | 2022-09-12 | 0.08 | 2 | 0 | 0 | NA |
| 7097 | Perfect for Your Parents, With Garden & Patio | 17571 | Jane | Brooklyn | Fort Greene | 40.69194 | -73.97389 | Private room | 205 | 2 | 374 | 2024-06-02 | 2.12 | 2 | 219 | 36 | OSE-STRREG-0000008 |
| 7801 | Sunny Williamsburg Loft with Sauna | 21207 | Chaya | Brooklyn | Williamsburg | 40.71881 | -73.95618 | Entire home/apt | 290 | 30 | 12 | 2023-10-31 | 0.07 | 1 | 219 | 2 | NA |
| 8490 | Maison des Sirenes1,bohemian, luminous apartment | 25183 | Nathalie | Brooklyn | Bedford-Stuyvesant | 40.68456 | -73.93963 | Entire home/apt | 170 | 30 | 190 | 2023-10-16 | 1.05 | 2 | 215 | 7 | NA |
| 9357 | Midtown Pied-a-terre | 30193 | Tommi | Manhattan | Hell’s Kitchen | 40.76724 | -73.98664 | Entire home/apt | 175 | 30 | 58 | 2017-08-13 | 0.32 | 1 | 281 | 0 | NA |
The knitr package is also part of the
tidyverse suite of packages and used to make rendering to
html simple and clean. Packages like kableExtra are support
packages that do additional formatting for html rendering.
Let’s say we wanted to select a view variables that could help us answer the question: What is the most expensive neighborhood for AirBnB listings by room type?
The dplyr package has a set of functions that allow us
to effectively wrangle data.
airbnb_data %>%
dplyr::select(neighbourhood, room_type, price) %>%
dplyr::group_by(neighbourhood, room_type) %>%
dplyr::summarize(avg_price = mean(price, na.rm = T)) %>%
dplyr::ungroup() %>%
dplyr::group_by(room_type) %>%
dplyr::filter(avg_price == max(avg_price, na.rm = T)) %>%
dplyr::arrange(-avg_price) %>%
knitr::kable() %>%
kableExtra::kable_styling() %>%
kableExtra::scroll_box(width = "100%", height = "200px")
| neighbourhood | room_type | avg_price |
|---|---|---|
| SoHo | Private room | 2956.045 |
| Todt Hill | Entire home/apt | 1070.500 |
| West Village | Shared room | 1000.000 |
| Washington Heights | Hotel room | 974.000 |
Let’s reshape our data set and create new columns. This time we are interested in neighborhoods that have private room prices that are less than the price for entire homes.
airbnb_data %>%
dplyr::select(neighbourhood, room_type, price) %>%
dplyr::group_by(neighbourhood, room_type) %>%
dplyr::summarize(avg_price = mean(price, na.rm = T)) %>%
dplyr::ungroup() %>%
tidyr::pivot_wider(
names_from = room_type,
values_from = avg_price
) %>%
janitor::clean_names() %>%
dplyr::mutate(ind1 = dplyr::if_else(private_room < entire_home_apt, 1, 0),
diff = entire_home_apt - private_room) %>%
dplyr::filter(ind1 == 1,
diff <= 100
) %>%
arrange(diff) %>%
utils::head(n = 10) %>%
knitr::kable() %>%
kableExtra::kable_styling() %>%
kableExtra::scroll_box(width = "100%", height = "200px")
| neighbourhood | entire_home_apt | private_room | shared_room | hotel_room | ind1 | diff |
|---|---|---|---|---|---|---|
| Graniteville | 102.00000 | 95.50000 | 99.00000 | NA | 1 | 6.500000 |
| Midtown | 388.56869 | 380.17936 | 106.83333 | 582.1358 | 1 | 8.389333 |
| Mount Eden | 122.50000 | 111.50000 | NA | NA | 1 | 11.000000 |
| Stuyvesant Town | 237.22727 | 221.14286 | NaN | NA | 1 | 16.084416 |
| Dyker Heights | 119.16667 | 102.66667 | NA | NA | 1 | 16.500000 |
| Kew Gardens Hills | 208.07692 | 189.33333 | NA | NA | 1 | 18.743590 |
| Unionport | 166.85714 | 148.00000 | NA | NA | 1 | 18.857143 |
| Fieldston | 101.33333 | 80.80000 | NA | NA | 1 | 20.533333 |
| Mariners Harbor | 95.44444 | 73.66667 | 129.50000 | NA | 1 | 21.777778 |
| Bedford-Stuyvesant | 218.67332 | 194.77198 | 75.18182 | NA | 1 | 23.901340 |
Now let’s create a plot of prices by room type and county
airbnb_data %>%
dplyr::filter(!is.na(price)) %>%
ggplot(aes(x = neighbourhood_group,y = log(price),
fill = room_type))+
geom_boxplot()+
labs(x = "county", y = "log avg price")
When using R markdown there are 3 areas in the document
to consider. (i) the YAML controls what the knitted document will look
like and also contains information about the title, date, and time the
document was made.
---
title: "Hampton University Workshop Guidebook"
output:
html_document:
toc: true
toc_float: true
---
Afterwards, (ii) empty document space can be used for html or css code. Most often it contains plain text as in all paragraphs in this document is written in this empty space.
Lastly, (iii) code chunks contain calculations that are closely relevant to each other. For example reading and viewing data or creating multiple plots.
# can still make comments in code chunks as long as preceded by a '#' sign
R code goes here
Here’s a template that can be used to start all reports.
# global default settings for chunks
knitr::opts_chunk$set( eval = T, warning = FALSE, message = FALSE, echo = T,
fig.dim = c(6, 3),
fig.width = 9,
fig.height = 5,
dpi = 300,
fig.pos = "!h"
)
# Load all packages for project
Packages <- c("tidyverse","patchwork")
invisible(lapply(Packages, library, character.only = TRUE))
#theme global setting for ggplot
theme_set(theme_minimal() +
theme(legend.position = "bottom") +
theme(plot.title = element_text(hjust = 0.5, size = 12),
plot.subtitle = element_text(hjust = 0.5, size = 8))
)
Now that the packages are loaded there is no need to specify which package a function is called from.
If interested: Here is how to make an interactive geospatial plot
using the leaflet package.
pall <- leaflet::colorFactor(
palette = "viridis",
domain = airbnb_data$room_type %>% factor())
airbnb_data %>%
rename("boro" = "neighbourhood_group") %>%
filter(boro %in% c("Brooklyn","Manhattan","Queens")) %>%
na.omit(price) %>%
sample_n(1000) %>%
mutate(
click_label =
str_c("<b>$", price, "</b><br>",
room_type, "<br>", number_of_reviews, " reviews")) %>%
leaflet::leaflet(width = "100%", height = 600) %>%
leaflet::addProviderTiles(leaflet::providers$CartoDB.Positron) %>%
leaflet::addCircleMarkers(lat = ~latitude, lng = ~longitude,
radius = 5,
color = ~pall(room_type),
popup = ~click_label,
stroke = NA) %>%
leaflet::addLegend("bottomright",
pal = pall,
values = ~room_type,
title = "Room Type",
opacity = 1)
Research question: Measuring pharmacy access and assessing it’s association with other neighborhood level variables.
Access is a term that refers to multiple dimensions of how one fits with a health system as defined by Penchensky and Thomas in 1981. As stated in the paper, aspects of access were long before investigated but this paper proposed a taxonomic definition that describes 5 dimensions of access as accessibility, availability, affordability, acceptability, and accommodation.
Accessibility speaks to ones proximity to an amenity,
Availability to volume,
Affordability to health insurance and income,
Acceptability to social norms when providing services, and
Accommodation to appropriateness of utilizing a service.
Differential access to amenities continues to be a landmark topic in the disparities discourse. Amenities such as parks, food, and healthcare are among the most popular with respect to built environments such as neighborhoods, (black2010neighborhoods, neckerman2009disparities) and existing disparities are exacerbated with growing demand during crisis (opez2020parks, boz2024investigating).
In the context of pharmacy access Guadamuz et.al found that pharmacy closures occurred mostly in minority neighborhoods in the most urban cities of the US between the years of 2007 - 2015 (Guadamuz2021). Given pharmacies serve as an extension of the healthcare system in the US (DiasFernandes2022), it is regarded as an important amenity of a neighborhood not only for medicine but also for food, counseling, and other services that physicians may not be able to provide.
The most popular way to measure pharmacy access is to use summary statistics of proximity measures, secondly density, and then more sophisticated geospatial smoothing (DiasFernandes2022).
Given the limited data available we will use pharmacy count and pharmacy density from the National Neighborhood Data Archive (NaNDA) and summarize pharmacy access in the state of Virgina. On Day 2, during the interactive session, we will pick other neighborhood level variables from the American Community Survey - 2017 5 year estimates to test their association with pharmacy access in the year 2017.
Variables of interest will be decided on day 2 when we have generate a hypothesis. Here is an example set.
Pharmacy count (primary outcome)
population count
race ethnicity
Median Household Income
Household Size
Rent Burden
Median Cost of owned home
Mobility
In this section you will learn how to source neighborhood variables through the American Community Survey (ACS). A census api key is needed in order to pull data from the ACS using an api. Pharmacy data can be directly downloaded from the National Neighborhood Data Archive (NaNDA).
acs_vars <- c(
# population
population = "B01001_001",
#Social economic measures
median_hh_inc ="B19013_001",
median_value_owner = "B25097_002",
below_hs = "B12001_002",
college = "B12001_004",
edu_tot = "B12001_001",
age_over_65 = "B01001_005",
age_tot = "B01001_001",
unemployment = "B23025_005",
une_tot = "B23025_003",
avg_hh_size = "B25010_001",
#Mobility measures
drive = "B08301_002",
public_trans = "B08301_010",
walked = "B08301_019",
work_home = "B08301_021",
total_mobile = "B08301_001", # used to calculate pct
# race ethnicity measures
non_hisp_white = "B03002_003",
non_hisp_black = "B03002_004",
hisp = "B03002_012",
non_hisp_asian = "B03002_006",
tot_pop = "B03002_001"
)
acs_check1 <- str_detect(acs_vars, '_0[0-9][0-9]$')
if (any(acs_check1==FALSE)) {
cat(acs_vars[which(acs_var_check1==FALSE)], '\n')
stop('PLEASE REVIEW ACS VARIABLE IDS or add underscore before last 3 numbers')
}
# PULL ACS DATA
############################################
acs_raw <- tidycensus::get_acs(
survey='acs5',
variables=acs_vars,
geography='tract',
state=c('VA'),
year=2017, #changed from 2019 to 2020 because new development may have include more island residence.
key=Sys.getenv('census_key'),
) %>%
rename_all(str_to_lower) %>%
select(-moe) %>%
group_by(geoid, name) %>%
pivot_wider(
values_from = "estimate",
names_from = "variable"
) %>%
mutate(
drive_pc = drive/total_mobile,
public_trans_pc = public_trans/total_mobile,
walked_pc = walked/total_mobile,
work_home_pc = work_home/total_mobile,
non_hisp_black_pc = non_hisp_black/tot_pop,
non_hisp_white_pc = non_hisp_white/tot_pop,
non_hisp_asian_pc = non_hisp_asian/tot_pop,
hisp_pc = hisp/tot_pop,
below_hs_pc = below_hs/edu_tot,
college_pc = college/edu_tot,
age_over_65_pc = age_over_65/population,
age_65_under_pc = 1-age_over_65_pc,
unemployment_pc = unemployment/une_tot
) %>%
ungroup()
In this section we learned foundation R syntax and used popular packages to import and wrangle data. We sourced data from the ACS website and now have a template for creating reproducible reports with R markdown.
If you’d like to learn more about what we talked about during this session here are a few resources that go into more detail.
In this section we will discuss a work flow and common packages used for exploratory data analysis (EDA).
When conducting exploratory data analysis
There are a few packages essential for EDA; namely:
ggplot, arsenal, patchwork,
ggally, sf, and tigris.
As we have seen ggplot handles almost all
visualization needs and it’s themes can be set at the beginning of the
document for consistent visualization. The package
patchwork allows for plots to be shown in a grid like
layout
For ease of table creation that includes descriptive statistics
and hypothesis test results, the arsenal packages handles
that and more
For quick bivariate comparisons, ggally creates a
matrix of both visualizations and tests.
Lastly, sf and tigris will be used to
create our geospatial plots. Not so common to EDA but to spatial
analysis spdep and rgeoda are very
popular.
Here is an example of an EDA conducted using the variables with
respect to our research question. First let’s create a table of our
variables of interest using the arsenal package.
Let’s define our controls for our table.
my_controls <- arsenal::tableby.control(
test = T,
total = T,
numeric.stats = c("N", "meansd", "medianq1q3", "range", "Nmiss2"),
cat.stats = c("countpct", "Nmiss2"),
stats.labels = list(
meansd = "Mean (SD)",
medianq1q3 = "Median (Q1, Q3)",
range = "Min - Max",
Nmiss2 = "Missing"
)
)
tableby(~., data = acs_raw %>%
select(drive_pc:hisp_pc)) %>%
summary(text = TRUE)%>%
knitr::kable() %>%
kableExtra::kable_styling() %>%
kableExtra::scroll_box(width = "100%", height = "200px")
| Overall (N=1907) | |
|---|---|
| drive_pc | |
|
32 |
|
0.868 (0.109) |
|
0.280 - 1.000 |
| public_trans_pc | |
|
32 |
|
0.042 (0.071) |
|
0.000 - 0.556 |
| walked_pc | |
|
32 |
|
0.026 (0.052) |
|
0.000 - 0.548 |
| work_home_pc | |
|
32 |
|
0.047 (0.034) |
|
0.000 - 0.439 |
| non_hisp_black_pc | |
|
28 |
|
0.201 (0.214) |
|
0.000 - 0.973 |
| non_hisp_white_pc | |
|
28 |
|
0.627 (0.241) |
|
0.000 - 1.000 |
| non_hisp_asian_pc | |
|
28 |
|
0.056 (0.079) |
|
0.000 - 0.613 |
| hisp_pc | |
|
28 |
|
0.083 (0.096) |
|
0.000 - 0.766 |
Let’s create visualizations.
What is the distribution of mobility in VA?
acs_raw %>%
ggplot()+
geom_histogram(aes(x = drive_pc, fill = "Drive"),alpha = 0.3)+
geom_histogram(aes(x = walked_pc, fill = "Walk"), alpha = 0.3)+
geom_histogram(aes(x = public_trans_pc, fill = "Public Transport"), alpha = 0.3)+
labs(x = "Percent of census tracts")+
scale_fill_brewer(palette = "Set3")
How does the distribution of race ethnicity correlate with mobility?
cor.test(acs_raw$drive_pc, acs_raw$non_hisp_black_pc, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: acs_raw$drive_pc and acs_raw$non_hisp_black_pc
## S = 1035548155, p-value = 0.01289
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.05742079
acs_raw %>%
ggplot()+
geom_point(aes(x = drive_pc, y = non_hisp_black_pc))+
acs_raw %>%
ggplot()+
geom_point(aes(x = public_trans_pc, y = non_hisp_black_pc))
Or we could use GGally to look at many comparisons at
once for our bivariate analysis. In the scatter plots the dots represent
census tracts.
acs_raw %>%
select(non_hisp_white_pc, non_hisp_black_pc, non_hisp_asian_pc, hisp_pc, drive_pc) %>%
GGally::ggpairs()
How does this look geospatially?
invisible(capture.output({
tracts = tigris::tracts(state = "VA", year = 2017, cb = T)
}))
va_acs = tigris::geo_join(
tracts,
acs_raw,
by_sp = "GEOID",
by_df = "geoid",
how = "inner"
)
va_acs %>%
ggplot()+
geom_sf(aes(fill = non_hisp_black_pc ))+
scale_fill_viridis_b()+
va_acs %>%
ggplot()+
geom_sf(aes(fill = drive_pc ))+
scale_fill_viridis_b()
How does pharmacy access relate to race-ethnicity?
pharm_data = read_csv("data/nanda_healthcare_tract_2003-2017_03P.csv") %>%
filter(year == 2017) %>%
select(count_446110, tract_fips10, popden_446110)
virginia_pharm = tigris::geo_join(
va_acs,
pharm_data ,
by_sp = "GEOID",
by_df = "tract_fips10",
how = "inner"
)
virginia_pharm %>%
sf::st_drop_geometry() %>%
ggplot()+
geom_histogram(aes(x=count_446110 ), fill = "red", alpha = 0.3)
virginia_pharm %>%
sf::st_drop_geometry() %>%
ggplot()+
geom_histogram(aes(x=popden_446110 ), fill = "red", alpha = 0.3)
virginia_pharm %>%
sf::st_drop_geometry() %>%
select(non_hisp_white_pc, non_hisp_black_pc, non_hisp_asian_pc, hisp_pc, popden_446110) %>%
GGally::ggpairs()
Geospatially?
virginia_pharm %>%
ggplot()+
geom_sf(aes(fill = count_446110 ))+
scale_fill_viridis_b()
virginia_pharm %>%
ggplot()+
geom_sf(aes(fill = count_446110 %>% log() ))+
scale_fill_viridis_b(name = "Log Pharmacy Count")
virginia_pharm %>%
ggplot()+
geom_sf(aes(fill = count_446110 %>% log() ))+
scale_fill_viridis_b(name = "Log Pharmacy Count") +
virginia_pharm %>%
ggplot()+
geom_sf(aes(fill = non_hisp_black_pc ))+
scale_fill_viridis_b()
virginia_pharm %>%
ggplot()+
geom_sf(aes(fill = count_446110 %>% log() ))+
scale_fill_viridis_b(name = "Log Pharmacy Count") +
virginia_pharm %>%
ggplot()+
geom_sf(aes(fill = non_hisp_white_pc ))+
scale_fill_viridis_b()
It turns out there are lots of zeros in the data and so we can only get 3 categories. This is largely exploratory and the pvalues are not necessarily comparable give that have not accounted for multiple comparisons. We are using a non-parametric test to assess the median difference in racial ethinic population percentages across neighborhood pharmacy count levels.
tableby(pharm_levels ~ count_446110+non_hisp_black_pc + non_hisp_white_pc + non_hisp_asian_pc+ hisp_pc, data = virginia_pharm %>%
sf::st_drop_geometry() %>%
mutate(pharm_levels = case_when(
count_446110 <= quantile(count_446110, 0.25) ~ 1,
count_446110 > quantile(count_446110, 0.25) & count_446110 <= quantile(count_446110, 0.50)~ 2,
count_446110 > quantile(count_446110, 0.50) & count_446110 <= quantile(count_446110, 0.75) ~ 3,
count_446110 > quantile(count_446110, 0.75) ~ 4
)), control = my_controls)%>%
summary(text = TRUE)%>%
knitr::kable() %>%
kableExtra::kable_styling() %>%
kableExtra::scroll_box(width = "100%", height = "200px")
| 1 (N=1083) | 3 (N=429) | 4 (N=388) | Total (N=1900) | p value | |
|---|---|---|---|---|---|
| count_446110 | < 0.001 | ||||
|
1083 | 429 | 388 | 1900 | |
|
0.000 (0.000) | 1.000 (0.000) | 2.554 (0.895) | 0.747 (1.078) | |
|
0.000 (0.000, 0.000) | 1.000 (1.000, 1.000) | 2.000 (2.000, 3.000) | 0.000 (0.000, 1.000) | |
|
0.000 - 0.000 | 1.000 - 1.000 | 2.000 - 7.000 | 0.000 - 7.000 | |
|
0 | 0 | 0 | 0 | |
| non_hisp_black_pc | 0.005 | ||||
|
1062 | 429 | 388 | 1879 | |
|
0.214 (0.232) | 0.193 (0.195) | 0.174 (0.177) | 0.201 (0.214) | |
|
0.133 (0.046, 0.286) | 0.127 (0.057, 0.254) | 0.108 (0.048, 0.241) | 0.127 (0.049, 0.272) | |
|
0.000 - 0.973 | 0.000 - 0.917 | 0.000 - 0.915 | 0.000 - 0.973 | |
|
21 | 0 | 0 | 21 | |
| non_hisp_white_pc | 0.008 | ||||
|
1062 | 429 | 388 | 1879 | |
|
0.613 (0.252) | 0.631 (0.234) | 0.657 (0.216) | 0.627 (0.241) | |
|
0.651 (0.447, 0.820) | 0.670 (0.490, 0.828) | 0.686 (0.514, 0.830) | 0.664 (0.469, 0.822) | |
|
0.000 - 1.000 | 0.043 - 0.995 | 0.033 - 0.999 | 0.000 - 1.000 | |
|
21 | 0 | 0 | 21 | |
| non_hisp_asian_pc | 0.959 | ||||
|
1062 | 429 | 388 | 1879 | |
|
0.056 (0.081) | 0.057 (0.076) | 0.055 (0.078) | 0.056 (0.079) | |
|
0.022 (0.003, 0.076) | 0.027 (0.005, 0.084) | 0.026 (0.006, 0.074) | 0.024 (0.004, 0.077) | |
|
0.000 - 0.543 | 0.000 - 0.504 | 0.000 - 0.613 | 0.000 - 0.613 | |
|
21 | 0 | 0 | 21 | |
| hisp_pc | 0.873 | ||||
|
1062 | 429 | 388 | 1879 | |
|
0.082 (0.094) | 0.085 (0.100) | 0.082 (0.097) | 0.083 (0.096) | |
|
0.052 (0.021, 0.107) | 0.052 (0.023, 0.106) | 0.052 (0.023, 0.103) | 0.052 (0.022, 0.106) | |
|
0.000 - 0.699 | 0.000 - 0.600 | 0.000 - 0.766 | 0.000 - 0.766 | |
|
21 | 0 | 0 | 21 |
During day’s 2 session we will cover how may might handle such findings within the model.
Testing the association between pharmacy count and racial ethnic groups. Typically, count data follows a poisson distribution, therefore it is modeled as such. Let, \(Y_i \sim Poi(\lambda_i)\) be a random variable that represents the number of pharmacies in a census tract \(i \in 1...n\). The rate is denoted as \(\lambda_i\) and the average pharmacy rate across all census tracts is represented as \(\sum Y_i/n =\lambda\)
# to choose which model is best to account for overdispersion
glm(count_446110 ~1, data = virginia_pharm %>% sf::st_drop_geometry(), family = poisson()) %>%
summary()
##
## Call:
## glm(formula = count_446110 ~ 1, family = poisson(), data = virginia_pharm %>%
## sf::st_drop_geometry())
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2226 -1.2226 -1.2226 0.2777 4.3375
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.29120 0.02654 -10.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2789.6 on 1899 degrees of freedom
## Residual deviance: 2789.6 on 1899 degrees of freedom
## AIC: 4735.6
##
## Number of Fisher Scoring iterations: 6
A strong consideration when modeling count data is dispersion. Specifically, overdispersion occurs when the variance of the data is larger than the mean. There are formal tests determining when count data are overdispersed, however we can compare the AIC, which represents a model’s goodness of fit, of a poisson or negative binomial model. The negative binomial model includes and additional parameter to adjust for the overdispersion.
MASS::glm.nb(count_446110 ~1, data = virginia_pharm %>% sf::st_drop_geometry()) %>%
summary()
##
## Call:
## MASS::glm.nb(formula = count_446110 ~ 1, data = virginia_pharm %>%
## sf::st_drop_geometry(), init.theta = 1.161837478, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0743 -1.0743 -1.0743 0.2123 2.7577
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.29120 0.03402 -8.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(1.1618) family taken to be 1)
##
## Null deviance: 1814.5 on 1899 degrees of freedom
## Residual deviance: 1814.5 on 1899 degrees of freedom
## AIC: 4534.7
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 1.162
## Std. Err.: 0.123
##
## 2 x log-likelihood: -4530.724
Now that we see that the negative binomial model has a better fit of our pharmacy count data we will adjust for social variables. This is mainly exploratory and more statistical care is needed such as multiple comparison which suggest we make the pvalue of statistical significance stricter given we are performing multiple tests with the same data.
MASS::glm.nb(count_446110 ~non_hisp_black_pc, data = virginia_pharm %>% sf::st_drop_geometry()) %>%
summary()
##
## Call:
## MASS::glm.nb(formula = count_446110 ~ non_hisp_black_pc, data = virginia_pharm %>%
## sf::st_drop_geometry(), init.theta = 1.223436363, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.132 -1.105 -1.005 0.254 2.691
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.17112 0.04603 -3.717 0.000201 ***
## non_hisp_black_pc -0.57604 0.16882 -3.412 0.000644 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(1.2234) family taken to be 1)
##
## Null deviance: 1818.0 on 1878 degrees of freedom
## Residual deviance: 1806.6 on 1877 degrees of freedom
## (21 observations deleted due to missingness)
## AIC: 4500.9
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 1.223
## Std. Err.: 0.133
##
## 2 x log-likelihood: -4494.899
MASS::glm.nb(count_446110 ~non_hisp_white_pc, data = virginia_pharm %>% sf::st_drop_geometry()) %>%
summary()
##
## Call:
## MASS::glm.nb(formula = count_446110 ~ non_hisp_white_pc, data = virginia_pharm %>%
## sf::st_drop_geometry(), init.theta = 1.22385824, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1621 -1.0967 -1.0084 0.2835 2.6372
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.5925 0.0984 -6.021 1.73e-09 ***
## non_hisp_white_pc 0.4883 0.1435 3.402 0.000668 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(1.2239) family taken to be 1)
##
## Null deviance: 1818.2 on 1878 degrees of freedom
## Residual deviance: 1807.0 on 1877 degrees of freedom
## (21 observations deleted due to missingness)
## AIC: 4501.1
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 1.224
## Std. Err.: 0.133
##
## 2 x log-likelihood: -4495.145
What about spatial correlation? Local Indicators of spatial association are used to determine if spatial dependence exist in the data. A common indicator to use is Moran’s index.
While Global Moran’s I is a global statistic for spatial randomness, Local Moran’s I (MI) estimates the tendency of census tracts to cluster or be spatial outliers; the null being spatial randomness, and significant pvalues suggesting spatial correlation is not random. Positive values of local MI values suggest the neighbors are high or low clustering, while negative local MI values suggest spatial outliers where a low valued census tract compared to the mean has neighbors with high values and vice versa. For every census tract a 0,1 weighted matrix denoted as \(w_{ij}\) is used to identify their neighbors based on a proximity matrix such as Queens contiguity (where neighbors share a border or point) in our case as shown below:
\[ W_{ij}=\begin{vmatrix} 0 &0&1&0&0\\ 0&1&i&1&0\\ 0&0&1&0&0 \end{vmatrix} \]
The Moran Score for each census tract is denoted as \(I_i\). Standardized values for a census tract \(z_i\) are compared with \(z_j\). Lastly, c. is a constant being the sum of all values.
\[z_i = \frac{x_i - \mu}{\sigma} \;,\; z_j = \frac{x_j-\mu}{\sigma} \; and \; c. = \sum_iz_i^{-2}\]
\[ I_i = c.z_i\sum_jw_{ij}z_j \]
spdep::moran.test(virginia_pharm$count_446110,
spdep::nb2listw(
spdep::include.self(
spdep::poly2nb(
virginia_pharm, queen = T))) ,
alternative = "greater")
##
## Moran I test under randomisation
##
## data: virginia_pharm$count_446110
## weights: spdep::nb2listw(spdep::include.self(spdep::poly2nb(virginia_pharm,
## virginia_pharm$count_446110
## weights: queen = T)))
##
## Moran I statistic standard deviate = 11.733, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.1464311483 -0.0005265929 0.0001568835
qw <- rgeoda::queen_weights(virginia_pharm)
moran <- rgeoda::local_moran(qw, virginia_pharm["count_446110"], cpu_threads = 4)
plot(sf::st_geometry(virginia_pharm),
col=sapply(rgeoda::lisa_clusters(moran, cutoff = 0.05), function(x){return(rgeoda::lisa_colors(moran)[[x+1]])}),
border = "#333333", lwd=0.2)
title(main = "Local Moran Map of Pharmacy Count")
legend('bottomleft', legend = rgeoda::lisa_labels(moran),
fill = rgeoda::lisa_colors(moran), border = "#eeeeee")
Given that their is proof of spatial correlation, additional statistical care is needed to test associations while adjusting for spatial dependence. This will be covered in Day 2 during our special topics section.
In this section we learned how to do exploratory data analysis using popular packages and now have a more robust template for future working documents.
If you’d like to learn more about what we talked about during this session here are a few resources that go into more detail.
Geospatially modeling pharmacy access adjusting for other census tract level social variables
Research question: Measuring pharmacy access and assessing it’s association with other neighborhood level variables.
Access is a term that refers to multiple dimensions of how one fits with a health system as defined by Penchensky and Thomas in 1981.
Accessibility speaks to ones proximity to an amenity, contiguity
Availability to volume, adjusting for population
Affordability to health insurance and income, thoughts?
Acceptability to social norms when providing services, and thoughts?
Accommodation to appropriateness of utilizing a service. thoughts?
The standardized incidence ratio is a comparison of the observed data and the expected. Ratio above 1 suggests the rate of pharmacy is higher than expected while values below 1 sugggests lower than expect. In order to calculate the expect rate adjusted by population the following formula is used:
\[E_i = \frac{total_{count}}{total_{pop}}*pop_i\]
acs_pharm_data = virginia_pharm %>%
rename("pharmacy_count" = "count_446110") %>%
mutate(total_population = sum(population, na.rm = T),
total_pharm = sum(pharmacy_count, na.rm = T),
E = population*total_pharm/total_population,
SIR = pharmacy_count/E)
acs_pharm_data %>%
sf::st_drop_geometry() %>%
ggplot()+
geom_histogram(aes(x=pharmacy_count ), fill = "red", alpha = 0.3)+
geom_histogram(aes(x=E ), fill = "blue", alpha = 0.3)+
acs_pharm_data %>%
sf::st_drop_geometry() %>%
ggplot()+
geom_histogram(aes(x=SIR ), fill = "green", alpha = 0.3)
Geospatially we can visualize the areas that have lower than expected
pharmacy count being r“purple”` and areas that are brighter
colors representing pharmacy counts that are higher than expect.
acs_pharm_data %>%
ggplot()+
geom_sf(aes(fill = pharmacy_count %>% log()))+
scale_fill_viridis_b(name = "Log Pharmacy Count")+
acs_pharm_data %>%
ggplot()+
geom_sf(aes(fill = SIR %>% log() ))+
scale_fill_viridis_b(name = "Log SIR")
Let \(Y_i \sim Poisson(\eta_i )\) be the distribution of the number of pharmacies in a given census tract with out considering spatial dependence where \(\eta_i\) is the rate for each census tract \(i\). Reparameterization for the Bayesian framework \(\eta_i = E_i \theta_i\) where \(E\) is the expected pharmacy count based on population, and \(\theta\) is the parameter of interest to estimate the rate of pharmacies.
\[log(\theta) = \beta_0+\beta_{1p} X+b\]
where \(\beta_0\) is the model intercept, \(\beta_{1p}\) is the coefficients for the \(p\) covariates in design matrix \(X\), and \(b\) is the spatial random effect following BYM2 (Besag, York, and Mollie) specifications:
$$ \[\begin{aligned} b &=((\sqrt{\rho})\mu+(\sqrt{1-\rho})v)\frac{1}{\tau_b} \\ \mu_i | \mu_j &\sim N( \sum_j\frac{W_{ij}\mu_j}{n_j}, 1), n_j = {u_{i\in j}},\\ v_i &\sim N(0,1) \\ \rho &\in (0,1) \\ \tau_b &\sim PC(U,\alpha) \\ \end{aligned}\]$$
where the spatial component \(\mu\) and non spatial component \(v\) are scaled to have variance \(1\).
I started first by confirming that spatial correlation was present by comparing a model fit with no spatial random effect with another that includes a spatial random effect.
# Creating network graph
nb <- spdep::poly2nb(acs_pharm_data, queen = T )
spdep::nb2INLA("map.adj", nb)
g <- INLA::inla.read.graph(filename = "map.adj")
# creating spatial id's
acs_pharm_data$re_u <- 1:nrow(acs_pharm_data)
model <- INLA::inla(pharmacy_count ~ 1,
family = "poisson",
E= E,
data = as.data.frame(acs_pharm_data),
control.predictor = list(compute = TRUE
# ,link = 1
),
control.compute = list(dic = TRUE,
cpo = T,
config = T))
modela <- INLA::inla(pharmacy_count ~ 1 + f(re_u, model = "bym2", constr = T,
scale.model = T,
graph = g#,
#hyper = prior
),
family = "poisson",
E= E,
data = as.data.frame(acs_pharm_data),
control.predictor = list(compute = TRUE,
link = 1),
control.compute = list(dic = TRUE,
cpo = T,
config = T))
#saveRDS(model, "inla_model.rds")
#saveRDS(modela, "inla_modela.rds")
model <- readRDS("inla_model.rds")
modela <- readRDS("inla_modela.rds")
#INLA::inla.priors.used(model)
model$dic$dic
## [1] 4603.244
modela$dic$dic
## [1] 4295.739
acs_pharm_data$RR_mod1 <- modela$summary.fitted.values[, "mean"]
acs_pharm_data$RR_mod1_HH <- modela$summary.fitted.values[, "0.975quant"]
acs_pharm_data$RR_mod1_LL <- modela$summary.fitted.values[, "0.025quant"]
ggplot(sf::st_as_sf(acs_pharm_data)) +
geom_sf(aes(fill = RR_mod1 %>% log()),size = 0.01 ) +
#scale_fill_viridis_c(option = "A", direction = -1)+
scale_fill_gradient2(midpoint = 0, low = "blue", mid = "white", high = "red", name = "",na.value = "#FFFFFF")+
# ggplot2::guides(fill=guide_legend(title = ""))+
# theme_bw()+
theme(axis.text = element_blank(),
legend.position = c(0.2,0.8))+
sf::st_as_sf(acs_pharm_data %>%
mutate(ind = case_when(RR_mod1_HH > 1 & RR_mod1_LL> 1 ~ 2,
RR_mod1_HH < 1 & RR_mod1_LL< 1 ~ 1,
RR_mod1_HH > 1 & RR_mod1_LL< 1 ~ 0),
RR_mod1 = ifelse(ind == 0, NA,RR_mod1)) ) %>%
ggplot() +
geom_sf(aes(fill = RR_mod1 %>% log()),size = 0.01 ) +
#scale_fill_viridis_c(option = "A", direction = -1)+
scale_fill_gradient2(midpoint = 0, low = "blue", mid = "white", high = "red", name = "",na.value = "#FFFFFF")+
# ggplot2::guides(fill=guide_legend(title = ""))+
# theme_bw()+
theme(axis.text = element_blank(),
legend.position = c(0.2,0.8))
The lower DIC value for the model adjusting for spatial correlation suggests that it is a better model. Next we will look at the distribution of our social variables with respect to pharmacy access levels.
Table of variable distribution by average and high pharmacy access neighborhoods.
tableby(ind~pharmacy_count + population+ drive_pc +
public_trans_pc +
walked_pc +
work_home_pc +
non_hisp_black_pc +
non_hisp_white_pc +
non_hisp_asian_pc +
hisp_pc +
below_hs_pc +
college_pc +
age_over_65_pc +
age_65_under_pc +
unemployment_pc , data = acs_pharm_data %>%
mutate(ind = case_when(RR_mod1_HH > 1 & RR_mod1_LL> 1 ~ 2,
RR_mod1_HH < 1 & RR_mod1_LL< 1 ~ 1,
RR_mod1_HH > 1 & RR_mod1_LL< 1 ~ 0),
RR_mod1 = ifelse(ind == 0, NA,RR_mod1)), control = my_controls_row ) %>% summary(text = TRUE)%>%
knitr::kable() %>%
kableExtra::kable_styling() %>%
kableExtra::scroll_box(width = "100%", height = "200px")
| 0 (N=1875) | 2 (N=25) | Total (N=1900) | p value | |
|---|---|---|---|---|
| pharmacy_count | < 0.001 | |||
|
1875 | 25 | 1900 | |
|
0.691 (0.962) | 5.000 (0.816) | 0.747 (1.078) | |
|
0.000 (0.000, 1.000) | 5.000 (4.000, 5.000) | 0.000 (0.000, 1.000) | |
|
0.000 - 5.000 | 4.000 - 7.000 | 0.000 - 7.000 | |
|
0 | 0 | 0 | |
| population | 0.450 | |||
|
1875 | 25 | 1900 | |
|
4406.850 (1941.796) | 4124.320 (1533.838) | 4403.133 (1936.932) | |
|
4186.000 (3038.000, 5562.500) | 3862.000 (3094.000, 4865.000) | 4183.000 (3043.000, 5555.750) | |
|
0.000 - 15271.000 | 1631.000 - 8072.000 | 0.000 - 15271.000 | |
|
0 | 0 | 0 | |
| drive_pc | 0.690 | |||
|
1850 | 25 | 1875 | |
|
0.868 (0.109) | 0.833 (0.148) | 0.868 (0.109) | |
|
0.900 (0.836, 0.936) | 0.910 (0.728, 0.936) | 0.900 (0.836, 0.936) | |
|
0.280 - 1.000 | 0.480 - 0.977 | 0.280 - 1.000 | |
|
25 | 0 | 25 | |
| public_trans_pc | 0.305 | |||
|
1850 | 25 | 1875 | |
|
0.042 (0.071) | 0.040 (0.073) | 0.042 (0.071) | |
|
0.013 (0.000, 0.052) | 0.006 (0.000, 0.011) | 0.013 (0.000, 0.052) | |
|
0.000 - 0.556 | 0.000 - 0.247 | 0.000 - 0.556 | |
|
25 | 0 | 25 | |
| walked_pc | 0.016 | |||
|
1850 | 25 | 1875 | |
|
0.025 (0.051) | 0.057 (0.092) | 0.026 (0.052) | |
|
0.010 (0.001, 0.026) | 0.023 (0.005, 0.045) | 0.010 (0.001, 0.027) | |
|
0.000 - 0.548 | 0.000 - 0.362 | 0.000 - 0.548 | |
|
25 | 0 | 25 | |
| work_home_pc | 0.324 | |||
|
1850 | 25 | 1875 | |
|
0.047 (0.034) | 0.051 (0.030) | 0.047 (0.034) | |
|
0.041 (0.022, 0.064) | 0.045 (0.031, 0.069) | 0.041 (0.022, 0.064) | |
|
0.000 - 0.439 | 0.000 - 0.119 | 0.000 - 0.439 | |
|
25 | 0 | 25 | |
| non_hisp_black_pc | 0.099 | |||
|
1854 | 25 | 1879 | |
|
0.202 (0.215) | 0.116 (0.102) | 0.201 (0.214) | |
|
0.128 (0.050, 0.272) | 0.087 (0.046, 0.162) | 0.127 (0.049, 0.272) | |
|
0.000 - 0.973 | 0.004 - 0.420 | 0.000 - 0.973 | |
|
21 | 0 | 21 | |
| non_hisp_white_pc | 0.011 | |||
|
1854 | 25 | 1879 | |
|
0.625 (0.242) | 0.751 (0.153) | 0.627 (0.241) | |
|
0.662 (0.466, 0.821) | 0.764 (0.703, 0.846) | 0.664 (0.469, 0.822) | |
|
0.000 - 1.000 | 0.410 - 0.984 | 0.000 - 1.000 | |
|
21 | 0 | 21 | |
| non_hisp_asian_pc | 0.931 | |||
|
1854 | 25 | 1879 | |
|
0.056 (0.080) | 0.048 (0.065) | 0.056 (0.079) | |
|
0.024 (0.004, 0.077) | 0.025 (0.007, 0.079) | 0.024 (0.004, 0.077) | |
|
0.000 - 0.613 | 0.000 - 0.252 | 0.000 - 0.613 | |
|
21 | 0 | 21 | |
| hisp_pc | 0.401 | |||
|
1854 | 25 | 1879 | |
|
0.083 (0.097) | 0.056 (0.045) | 0.083 (0.096) | |
|
0.052 (0.022, 0.106) | 0.041 (0.027, 0.066) | 0.052 (0.022, 0.106) | |
|
0.000 - 0.766 | 0.000 - 0.170 | 0.000 - 0.766 | |
|
21 | 0 | 21 | |
| below_hs_pc | 0.738 | |||
|
1854 | 25 | 1879 | |
|
0.486 (0.056) | 0.493 (0.065) | 0.486 (0.056) | |
|
0.484 (0.461, 0.506) | 0.485 (0.453, 0.515) | 0.484 (0.461, 0.506) | |
|
0.000 - 1.000 | 0.375 - 0.649 | 0.000 - 1.000 | |
|
21 | 0 | 21 | |
| college_pc | 0.136 | |||
|
1854 | 25 | 1879 | |
|
0.259 (0.068) | 0.245 (0.061) | 0.259 (0.067) | |
|
0.273 (0.223, 0.307) | 0.253 (0.196, 0.292) | 0.273 (0.223, 0.307) | |
|
0.000 - 0.403 | 0.100 - 0.363 | 0.000 - 0.403 | |
|
21 | 0 | 21 | |
| age_over_65_pc | 0.008 | |||
|
1854 | 25 | 1879 | |
|
0.031 (0.015) | 0.025 (0.019) | 0.031 (0.015) | |
|
0.030 (0.021, 0.041) | 0.019 (0.014, 0.029) | 0.030 (0.021, 0.041) | |
|
0.000 - 0.091 | 0.000 - 0.069 | 0.000 - 0.091 | |
|
21 | 0 | 21 | |
| age_65_under_pc | 0.008 | |||
|
1854 | 25 | 1879 | |
|
0.969 (0.015) | 0.975 (0.019) | 0.969 (0.015) | |
|
0.970 (0.959, 0.979) | 0.981 (0.971, 0.986) | 0.970 (0.959, 0.979) | |
|
0.909 - 1.000 | 0.931 - 1.000 | 0.909 - 1.000 | |
|
21 | 0 | 21 | |
| unemployment_pc | 0.216 | |||
|
1850 | 25 | 1875 | |
|
0.058 (0.039) | 0.065 (0.037) | 0.058 (0.039) | |
|
0.050 (0.032, 0.074) | 0.063 (0.036, 0.082) | 0.050 (0.032, 0.074) | |
|
0.000 - 0.327 | 0.016 - 0.166 | 0.000 - 0.327 | |
|
25 | 0 | 25 |
model3 <- INLA::inla(pharmacy_count ~ below_hs_pc + age_over_65_pc + non_hisp_black_pc +
non_hisp_asian_pc +
hisp_pc + f(re_u, model = "bym2", constr = T,
scale.model = T,
graph = g#,
#hyper = prior
),
family = "poisson",
E= E,
data = as.data.frame(acs_pharm_data),
control.predictor = list(compute = TRUE,
link = 1),
control.compute = list(dic = TRUE,
cpo = T,
config = T))
#saveRDS(model3, "inla_model3.rds")
model3 <- readRDS("inla_model3.rds")
model3$summary.fixed %>%
knitr::kable() %>%
kableExtra::kable_styling() %>%
kableExtra::scroll_box(width = "100%", height = "200px")
| mean | sd | 0.025quant | 0.5quant | 0.975quant | mode | kld | |
|---|---|---|---|---|---|---|---|
| (Intercept) | 1.7528588 | 0.3700598 | 1.0262775 | 1.7531160 | 2.4779909 | 1.7536354 | 0 |
| below_hs_pc | -2.9017122 | 0.7214740 | -4.3182868 | -2.9012041 | -1.4880100 | -2.9001867 | 0 |
| age_over_65_pc | -15.9760566 | 2.3384438 | -20.5650973 | -15.9752735 | -11.3914527 | -15.9737139 | 0 |
| non_hisp_black_pc | -0.5104305 | 0.1963984 | -0.8956691 | -0.5104553 | -0.1250489 | -0.5105019 | 0 |
| non_hisp_asian_pc | -0.0940878 | 0.5344479 | -1.1338354 | -0.0971959 | 0.9631976 | -0.1034178 | 0 |
| hisp_pc | 0.0484742 | 0.4052122 | -0.7430814 | 0.0473430 | 0.8464207 | 0.0450768 | 0 |
acs_pharm_data$RR_mod3 <- model3$summary.fitted.values[, "mean"]
acs_pharm_data$RR_mod3_HH <- model3$summary.fitted.values[, "0.975quant"]
acs_pharm_data$RR_mod3_LL <- model3$summary.fitted.values[, "0.025quant"]
ggplot(sf::st_as_sf(acs_pharm_data)) +
geom_sf(aes(fill = RR_mod3 %>% log()),size = 0.01 ) +
#scale_fill_viridis_c(option = "A", direction = -1)+
scale_fill_gradient2(midpoint = 0, low = "blue", mid = "white", high = "red", name = "",na.value = "#FFFFFF")+
# ggplot2::guides(fill=guide_legend(title = ""))+
# theme_bw()+
theme(axis.text = element_blank(),
legend.position = c(0.2,0.8))+
sf::st_as_sf(acs_pharm_data %>%
mutate(ind = case_when(RR_mod3_HH > 1 & RR_mod3_LL> 1 ~ 2,
RR_mod3_HH < 1 & RR_mod3_LL< 1 ~ 1,
RR_mod3_HH > 1 & RR_mod3_LL< 1 ~ 0),
RR_mod3 = ifelse(ind == 0, NA,RR_mod3)) ) %>%
ggplot() +
geom_sf(aes(fill = RR_mod3 %>% log()),size = 0.01 ) +
#scale_fill_viridis_c(option = "A", direction = -1)+
scale_fill_gradient2(midpoint = 0, low = "blue", mid = "white", high = "red", name = "",na.value = "#FFFFFF")+
# ggplot2::guides(fill=guide_legend(title = ""))+
# theme_bw()+
theme(axis.text = element_blank(),
legend.position = c(0.2,0.8))
tableby(ind ~pharmacy_count+ E+ RR_mod3+population+ drive_pc +
public_trans_pc +
walked_pc +
work_home_pc +
non_hisp_black_pc +
non_hisp_white_pc +
non_hisp_asian_pc +
hisp_pc +
below_hs_pc +
college_pc +
age_over_65_pc +
age_65_under_pc +
unemployment_pc, data = acs_pharm_data %>%
mutate(ind = case_when(RR_mod3_HH > 1 & RR_mod3_LL> 1 ~ "high",
RR_mod3_HH < 1 & RR_mod3_LL< 1 ~ "low",
RR_mod3_HH > 1 & RR_mod3_LL< 1 ~ "average"),
RR_mod3 = ifelse(ind == 0, NA,RR_mod3)), control = my_controls_row) %>%
summary(test = F)
| average (N=1849) | high (N=49) | low (N=2) | Total (N=1900) | |
|---|---|---|---|---|
| pharmacy_count | ||||
| N | 1849 | 49 | 2 | 1900 |
| Mean (SD) | 0.699 (0.968) | 2.612 (2.498) | 0.000 (0.000) | 0.747 (1.078) |
| Median (Q1, Q3) | 0.000 (0.000, 1.000) | 3.000 (0.000, 5.000) | 0.000 (0.000, 0.000) | 0.000 (0.000, 1.000) |
| Min - Max | 0.000 - 5.000 | 0.000 - 7.000 | 0.000 - 0.000 | 0.000 - 7.000 |
| Missing | 0 | 0 | 0 | 0 |
| E | ||||
| N | 1849 | 49 | 2 | 1900 |
| Mean (SD) | 0.757 (0.321) | 0.371 (0.394) | 0.887 (0.462) | 0.747 (0.329) |
| Median (Q1, Q3) | 0.718 (0.522, 0.946) | 0.380 (0.000, 0.647) | 0.887 (0.724, 1.051) | 0.710 (0.517, 0.943) |
| Min - Max | 0.001 - 2.592 | 0.000 - 1.370 | 0.561 - 1.214 | 0.000 - 2.592 |
| Missing | 0 | 0 | 0 | 0 |
| RR_mod3 | ||||
| N | 1849 | 49 | 2 | 1900 |
| Mean (SD) | 1.042 (0.493) | 5.749 (2.111) | 0.306 (0.023) | 1.163 (0.952) |
| Median (Q1, Q3) | 0.901 (0.699, 1.265) | 5.369 (3.707, 7.584) | 0.306 (0.298, 0.314) | 0.914 (0.701, 1.300) |
| Min - Max | 0.307 - 3.344 | 2.695 - 9.550 | 0.290 - 0.323 | 0.290 - 9.550 |
| Missing | 0 | 0 | 0 | 0 |
| population | ||||
| N | 1849 | 49 | 2 | 1900 |
| Mean (SD) | 4460.984 (1891.347) | 2186.429 (2318.872) | 5228.500 (2721.654) | 4403.133 (1936.932) |
| Median (Q1, Q3) | 4232.000 (3078.000, 5572.000) | 2237.000 (0.000, 3809.000) | 5228.500 (4266.250, 6190.750) | 4183.000 (3043.000, 5555.750) |
| Min - Max | 8.000 - 15271.000 | 0.000 - 8072.000 | 3304.000 - 7153.000 | 0.000 - 15271.000 |
| Missing | 0 | 0 | 0 | 0 |
| drive_pc | ||||
| N | 1845 | 28 | 2 | 1875 |
| Mean (SD) | 0.868 (0.108) | 0.816 (0.167) | 0.906 (0.007) | 0.868 (0.109) |
| Median (Q1, Q3) | 0.900 (0.837, 0.936) | 0.906 (0.690, 0.938) | 0.906 (0.904, 0.909) | 0.900 (0.836, 0.936) |
| Min - Max | 0.280 - 1.000 | 0.422 - 1.000 | 0.901 - 0.912 | 0.280 - 1.000 |
| Missing | 4 | 21 | 0 | 25 |
| public_trans_pc | ||||
| N | 1845 | 28 | 2 | 1875 |
| Mean (SD) | 0.042 (0.071) | 0.055 (0.097) | 0.053 (0.021) | 0.042 (0.071) |
| Median (Q1, Q3) | 0.013 (0.000, 0.052) | 0.010 (0.000, 0.053) | 0.053 (0.046, 0.061) | 0.013 (0.000, 0.052) |
| Min - Max | 0.000 - 0.556 | 0.000 - 0.399 | 0.039 - 0.068 | 0.000 - 0.556 |
| Missing | 4 | 21 | 0 | 25 |
| walked_pc | ||||
| N | 1845 | 28 | 2 | 1875 |
| Mean (SD) | 0.025 (0.051) | 0.062 (0.092) | 0.013 (0.018) | 0.026 (0.052) |
| Median (Q1, Q3) | 0.010 (0.001, 0.026) | 0.023 (0.004, 0.057) | 0.013 (0.006, 0.019) | 0.010 (0.001, 0.027) |
| Min - Max | 0.000 - 0.548 | 0.000 - 0.362 | 0.000 - 0.026 | 0.000 - 0.548 |
| Missing | 4 | 21 | 0 | 25 |
| work_home_pc | ||||
| N | 1845 | 28 | 2 | 1875 |
| Mean (SD) | 0.047 (0.034) | 0.046 (0.027) | 0.025 (0.006) | 0.047 (0.034) |
| Median (Q1, Q3) | 0.041 (0.022, 0.064) | 0.044 (0.030, 0.062) | 0.025 (0.022, 0.027) | 0.041 (0.022, 0.064) |
| Min - Max | 0.000 - 0.439 | 0.000 - 0.110 | 0.020 - 0.029 | 0.000 - 0.439 |
| Missing | 4 | 21 | 0 | 25 |
| non_hisp_black_pc | ||||
| N | 1849 | 28 | 2 | 1879 |
| Mean (SD) | 0.202 (0.215) | 0.115 (0.103) | 0.145 (0.101) | 0.201 (0.214) |
| Median (Q1, Q3) | 0.128 (0.050, 0.272) | 0.088 (0.038, 0.145) | 0.145 (0.109, 0.180) | 0.127 (0.049, 0.272) |
| Min - Max | 0.000 - 0.973 | 0.000 - 0.420 | 0.074 - 0.216 | 0.000 - 0.973 |
| Missing | 0 | 21 | 0 | 21 |
| non_hisp_white_pc | ||||
| N | 1849 | 28 | 2 | 1879 |
| Mean (SD) | 0.625 (0.242) | 0.758 (0.166) | 0.386 (0.249) | 0.627 (0.241) |
| Median (Q1, Q3) | 0.662 (0.466, 0.820) | 0.780 (0.668, 0.870) | 0.386 (0.298, 0.474) | 0.664 (0.469, 0.822) |
| Min - Max | 0.000 - 1.000 | 0.410 - 1.000 | 0.210 - 0.562 | 0.000 - 1.000 |
| Missing | 0 | 21 | 0 | 21 |
| non_hisp_asian_pc | ||||
| N | 1849 | 28 | 2 | 1879 |
| Mean (SD) | 0.056 (0.080) | 0.048 (0.068) | 0.081 (0.071) | 0.056 (0.079) |
| Median (Q1, Q3) | 0.024 (0.004, 0.077) | 0.019 (0.006, 0.069) | 0.081 (0.056, 0.106) | 0.024 (0.004, 0.077) |
| Min - Max | 0.000 - 0.613 | 0.000 - 0.252 | 0.031 - 0.131 | 0.000 - 0.613 |
| Missing | 0 | 21 | 0 | 21 |
| hisp_pc | ||||
| N | 1849 | 28 | 2 | 1879 |
| Mean (SD) | 0.083 (0.096) | 0.051 (0.047) | 0.343 (0.243) | 0.083 (0.096) |
| Median (Q1, Q3) | 0.052 (0.022, 0.106) | 0.038 (0.014, 0.069) | 0.343 (0.257, 0.428) | 0.052 (0.022, 0.106) |
| Min - Max | 0.000 - 0.766 | 0.000 - 0.170 | 0.171 - 0.514 | 0.000 - 0.766 |
| Missing | 0 | 21 | 0 | 21 |
| below_hs_pc | ||||
| N | 1849 | 28 | 2 | 1879 |
| Mean (SD) | 0.487 (0.055) | 0.461 (0.108) | 0.498 (0.002) | 0.486 (0.056) |
| Median (Q1, Q3) | 0.484 (0.461, 0.506) | 0.477 (0.439, 0.507) | 0.498 (0.497, 0.498) | 0.484 (0.461, 0.506) |
| Min - Max | 0.252 - 1.000 | 0.000 - 0.649 | 0.496 - 0.499 | 0.000 - 1.000 |
| Missing | 0 | 21 | 0 | 21 |
| college_pc | ||||
| N | 1849 | 28 | 2 | 1879 |
| Mean (SD) | 0.260 (0.067) | 0.221 (0.079) | 0.332 (0.005) | 0.259 (0.067) |
| Median (Q1, Q3) | 0.274 (0.223, 0.307) | 0.243 (0.193, 0.277) | 0.332 (0.330, 0.333) | 0.273 (0.223, 0.307) |
| Min - Max | 0.000 - 0.403 | 0.000 - 0.327 | 0.328 - 0.335 | 0.000 - 0.403 |
| Missing | 0 | 21 | 0 | 21 |
| age_over_65_pc | ||||
| N | 1849 | 28 | 2 | 1879 |
| Mean (SD) | 0.031 (0.015) | 0.016 (0.012) | 0.085 (0.008) | 0.031 (0.015) |
| Median (Q1, Q3) | 0.030 (0.021, 0.041) | 0.014 (0.009, 0.021) | 0.085 (0.082, 0.088) | 0.030 (0.021, 0.041) |
| Min - Max | 0.000 - 0.079 | 0.000 - 0.057 | 0.079 - 0.091 | 0.000 - 0.091 |
| Missing | 0 | 21 | 0 | 21 |
| age_65_under_pc | ||||
| N | 1849 | 28 | 2 | 1879 |
| Mean (SD) | 0.969 (0.015) | 0.984 (0.012) | 0.915 (0.008) | 0.969 (0.015) |
| Median (Q1, Q3) | 0.970 (0.959, 0.979) | 0.986 (0.979, 0.991) | 0.915 (0.912, 0.918) | 0.970 (0.959, 0.979) |
| Min - Max | 0.921 - 1.000 | 0.943 - 1.000 | 0.909 - 0.921 | 0.909 - 1.000 |
| Missing | 0 | 21 | 0 | 21 |
| unemployment_pc | ||||
| N | 1845 | 28 | 2 | 1875 |
| Mean (SD) | 0.058 (0.039) | 0.059 (0.037) | 0.047 (0.031) | 0.058 (0.039) |
| Median (Q1, Q3) | 0.050 (0.032, 0.074) | 0.052 (0.034, 0.075) | 0.047 (0.036, 0.057) | 0.050 (0.032, 0.074) |
| Min - Max | 0.000 - 0.327 | 0.000 - 0.166 | 0.025 - 0.068 | 0.000 - 0.327 |
| Missing | 4 | 21 | 0 | 25 |